In the last analysis section, the focus is based on understanding the behavior of the users so we’ll use an approach based on classification techniques.
As first step, let’s load and prepare the required data. For the resident identikit we’ll need a new binary variable (is_resident) that we omitted in the original data preparation module. As before, we’ll split the dataset following the 80/20 rule.
# Load dataset from cache
if (file.exists("Cache/df_features.rds") && file.exists("Cache/sf_gatesGPS.rds")) {
sf_gatesGPS <- readRDS("Cache/sf_gatesGPS.rds")
df <- readRDS("Cache/df_features.rds")
# Change id_varco to its real location
df <- df %>%
left_join(sf_gatesGPS %>%
st_drop_geometry() %>%
dplyr::select(id_amat, label),
by = c("id_varco" = "id_amat"))
df$id_varco <- NULL
df$label <- as.factor(df$label)
rm(sf_gatesGPS)
message("Data loaded successfully.")
} else {
stop("Run step 01 first.")
}
## Data loaded successfully.
# Split data
set.seed(123)
train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]
rm(df, train_index)
# Remove NAs (they are 23)
train_data <- na.omit(train_data)
test_data <- na.omit(test_data)
# Transofmr to factor (LDA, KNN require that)
train_data$residenti <- as.factor(train_data$residenti)
test_data$residenti <- as.factor(test_data$residenti)
train_data$tipologia_alimentazione <- as.factor(train_data$tipologia_alimentazione)
train_data$categoria_veicolo <- as.factor(train_data$categoria_veicolo)
If we look at the available data, we can spot an inequity between residents and non-residents: only 15% of all the transits are made by residents.
To avoid an unbalanced test and train set, I decided to re-sample the dataset considering the inequity. The new dataset dimension will match the number of resident’s transits.
# Verify inequity
cat("Original dataset:")
## Original dataset:
prop.table(table(train_data$residenti))
##
## FALSE TRUE
## 0.8476233 0.1523767
set.seed(123)
# Data preparation
residenti_true <- train_data[train_data$residenti == TRUE, ]
residenti_false <- train_data[train_data$residenti == FALSE, ]
n_residenti <- nrow(residenti_true) # The sample
# Non resident sampling
non_residenti_sampled <- residenti_false[sample(1:nrow(residenti_false), n_residenti), ]
# New dataset
train_balanced <- rbind(residenti_true, non_residenti_sampled)
rm(residenti_true, residenti_false, non_residenti_sampled)
# Verify equity
cat("New dataset:")
## New dataset:
prop.table(table(train_balanced$residenti))
##
## FALSE TRUE
## 0.5 0.5
What does distinguish a resident? It is the timestamp? The gate? The vehicle type?
The first research question we will answer with classification is about the resident identikit. This is not a simple task: knowing the Milan’s context and by looking at the EDA, predicting the habits of a resident it’s challenging due to their unpredictable behavior. For this reason, the approach I decided to follow for the predictor choice may sound unconventional but is the one that can help us the most while identifying a resident.
Beside the linear one, the Logistic Regression is used when the answer is boolean. In our case the answer to the question “is the user a resident?” can only be yes or no, so the Logistic Regression suits perfectly or goal. With this analysis we’ll see the Odds or Logit
\[\ln\left(\frac{P}{1-P}\right) = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_nX_n\] where: * \(P\) is the user probability of being a resident * \(\frac{P}{1-P}\) is the Odds (the relationship between being or not being a resident)
By applying the exponential function to the coefficients (\(\exp(\beta)\)) we’ll be able to see how much each predictor influence the probability of being a resident.
The predictors chosen for this analysis are: * hour: the
resident are peaks related to their lifestyle that differs from the
workers * is_weekend: from the EDA we saw that non-resident
traffic decreases over the weekends *
tipologia_alimentazione and is_pollutant:
since residents have special permission based on the Euro
classification, it may be ineresting have a look; also profiling the
fule type can help us to define a more precise identikit *
categoria_veicolo: a resident should rarely access the Area
C with a truck…
# Model
logit_model <- glm(residenti ~ hour + is_weekend + tipologia_alimentazione +
categoria_veicolo + is_pollutant,
data = train_balanced,
family = binomial)
summary(logit_model)
##
## Call:
## glm(formula = residenti ~ hour + is_weekend + tipologia_alimentazione +
## categoria_veicolo + is_pollutant, family = binomial, data = train_balanced)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8700787 0.0091082 -205.319 <2e-16
## hour 0.0517394 0.0002816 183.702 <2e-16
## is_weekendTRUE 0.0714617 0.0034368 20.793 <2e-16
## tipologia_alimentazionePetrol 1.0336620 0.0057726 179.065 <2e-16
## tipologia_alimentazioneDiesel 0.4310577 0.0062391 69.089 <2e-16
## tipologia_alimentazioneElectric -3.5437095 0.0229107 -154.675 <2e-16
## tipologia_alimentazioneGases -0.1219633 0.0072529 -16.816 <2e-16
## tipologia_alimentazioneHybrid 0.1304611 0.0061138 21.339 <2e-16
## tipologia_alimentazioneFuel-oil mixture -5.5362872 26.6643860 -0.208 0.836
## categoria_veicoloOther -3.7542152 0.1650746 -22.743 <2e-16
## categoria_veicoloBus -1.9489438 0.0295483 -65.958 <2e-16
## categoria_veicoloGoods -1.7334523 0.0135072 -128.336 <2e-16
## categoria_veicoloPeople 1.1296062 0.0096518 117.036 <2e-16
## is_pollutantTRUE -1.8256059 0.0088179 -207.035 <2e-16
##
## (Intercept) ***
## hour ***
## is_weekendTRUE ***
## tipologia_alimentazionePetrol ***
## tipologia_alimentazioneDiesel ***
## tipologia_alimentazioneElectric ***
## tipologia_alimentazioneGases ***
## tipologia_alimentazioneHybrid ***
## tipologia_alimentazioneFuel-oil mixture
## categoria_veicoloOther ***
## categoria_veicoloBus ***
## categoria_veicoloGoods ***
## categoria_veicoloPeople ***
## is_pollutantTRUE ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3077859 on 2220205 degrees of freedom
## Residual deviance: 2460638 on 2220192 degrees of freedom
## AIC: 2460666
##
## Number of Fisher Scoring iterations: 6
# Compute odds ratio
odds_ratios <- exp(coef(logit_model))
cat("Odds ratio: ", odds_ratios)
## Odds ratio: 0.1541115 1.053101 1.074077 2.811342 1.538884 0.0289059 0.8851809 1.139354 0.003941132 0.02341882 0.1424244 0.1766734 3.094438 0.16112
Since we’re using Dummy Coding, R has chosen for us the first level of each list as the baseline hiding it behind the Intercept. Let’s print the list values:
levels(train_balanced$tipologia_alimentazione)
## [1] "Not Specified" "Petrol" "Diesel" "Electric"
## [5] "Gases" "Hybrid" "Fuel-oil mixture"
levels(train_balanced$categoria_veicolo)
## [1] "Not Specified" "Other" "Bus" "Goods"
## [5] "People"
By looking at the output, we can find that the hidden values are “Not Specified” so the other coefficient will be idnicate how the probability changes given the “Not Specified” value as baseline.
table_or <- data.frame(
Coefficient = c("Intercept",
"categoria_veicoloPeople",
"tipologia_alimentazionePetrol",
"tipologia_alimentazioneDiesel",
"hour",
"is_pollutant",
"tipologia_alimentazioneElectric"),
RealCategory = c("Baseline",
"Car",
"Petrol",
"Diesel",
"Hour",
"Pollutant",
"Electric"),
Coefficient = c(-1.87,
1.12,
1.03,
0.43,
0.05,
-1.82,
-3.54),
OddsRatio = c(0.15,
3.09,
2.81,
1.54,
1.05,
0.16,
0.02),
Comment = c("Base probability very low",
"Decent prob. and high influence",
"Decent prob. and high influence",
"Small prob and decent influence",
"EDA confirmed, decent influence",
"Low prob. and low influence",
"Really low prob and influence"),
stringsAsFactors = FALSE)
knitr::kable(
table_or,
caption = "Odds Ratio Interpretation",
align = "l"
)
| Coefficient | RealCategory | Coefficient.1 | OddsRatio | Comment |
|---|---|---|---|---|
| Intercept | Baseline | -1.87 | 0.15 | Base probability very low |
| categoria_veicoloPeople | Car | 1.12 | 3.09 | Decent prob. and high influence |
| tipologia_alimentazionePetrol | Petrol | 1.03 | 2.81 | Decent prob. and high influence |
| tipologia_alimentazioneDiesel | Diesel | 0.43 | 1.54 | Small prob and decent influence |
| hour | Hour | 0.05 | 1.05 | EDA confirmed, decent influence |
| is_pollutant | Pollutant | -1.82 | 0.16 | Low prob. and low influence |
| tipologia_alimentazioneElectric | Electric | -3.54 | 0.02 | Really low prob and influence |
rm(table_or, odds_ratios)
Looking at the computed Odds Ration results we can assets that being a resident is, by default, less probable. The first three characteristics we may want to use for the identikit (+ a brief context explanation) are:
Petrol vehicles and Car if the vehicle is a car and it uses petrol, the probability of being a resident are higher than the baseline. This leads to tow option: (i) residents have traditional cars for small joruney, (ii) residents have sportcars (usually are petrol-powered). Being the most exclusive area of Milan, the cars inside the Cerchia dei Bastioni are probably supercars.
Electric: even if having an electric car gives the resident free and unlimited access, the electric-powered vehicles are a rarity among residents. This means also that all the electric vehicle we saw in the EDA module may be commercial ones (probably for mid-day deliveries).
The pollution is interesting because it has a negative impact on the chances of being a resident but it’s not that low. This may confirm the hypotesys about supercars I’ve just made.
The time is still a good predictor, probably due to the entrance peak in the afternoon we saw in the EDA; with a Odds Ratio of 1.05, every hour passed increments the probability of being a resident.
By this first analysis we can predict that a resident probably drives an electric or diesel car in the afternoon.
From a pure statistic point of view, all the p-values (except from oil-fuel mixture) are extremely low giving statistical significance to the used predictors. Also, the residual deviance is showing a value of about 2,46M that, given the deviance of a blind model (null deviance), shows a 20% reduction; this confirms that the chosen predictors are the right ones and have a strong discriminant power.
With the previous model we were able to understand the probabilities and the qualitative identikit. The Linear Discriminant Analysis model tries to project the data into a new dimension maximizing the separation between the means of residents vs non-residents. This will mark a border between two “cloud points” and fine tuning the possibilities that makes an user a resident.
We’ll print the LD1 (coefficients) and their values: they’ll indicate how much every variable contributes to mark a user as resident.
lda_model <- lda(residenti ~ hour + is_weekend + is_pollutant +
tipologia_alimentazione + categoria_veicolo,
data = train_balanced)
print(lda_model)
## Call:
## lda(residenti ~ hour + is_weekend + is_pollutant + tipologia_alimentazione +
## categoria_veicolo, data = train_balanced)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.5 0.5
##
## Group means:
## hour is_weekendTRUE is_pollutantTRUE tipologia_alimentazionePetrol
## FALSE 12.00020 0.2537035 0.08805129 0.2066862
## TRUE 13.93109 0.2887029 0.01670926 0.4907022
## tipologia_alimentazioneDiesel tipologia_alimentazioneElectric
## FALSE 0.2804307 0.087299106
## TRUE 0.1847928 0.001861989
## tipologia_alimentazioneGases tipologia_alimentazioneHybrid
## FALSE 0.10457048 0.1796797
## TRUE 0.06446069 0.1735578
## tipologia_alimentazioneFuel-oil mixture categoria_veicoloOther
## FALSE 9.008173e-07 3.320413e-03
## TRUE 0.000000e+00 3.333024e-05
## categoria_veicoloBus categoria_veicoloGoods categoria_veicoloPeople
## FALSE 0.03098902 0.19748528 0.7107061
## TRUE 0.00124493 0.01105573 0.9699496
##
## Coefficients of linear discriminants:
## LD1
## hour 0.04798548
## is_weekendTRUE 0.06198336
## is_pollutantTRUE -1.46899910
## tipologia_alimentazionePetrol 1.09501077
## tipologia_alimentazioneDiesel 0.39211439
## tipologia_alimentazioneElectric -1.72176794
## tipologia_alimentazioneGases -0.12528517
## tipologia_alimentazioneHybrid 0.14182822
## tipologia_alimentazioneFuel-oil mixture -0.71176605
## categoria_veicoloOther -1.34537824
## categoria_veicoloBus -0.55151852
## categoria_veicoloGoods -0.72120281
## categoria_veicoloPeople 1.18140372
The LDA results are coherent with the Logistic Regression ones: the patterns distinguishing the residents are stable and model-independent.
Quickly looking at the group means, we can spot the mentioned confirmations:
If we want to look the “direction” of the prediction to understand which one tent to the resident, we can look the LD1 coefficients.
table_ld1 <- data.frame(
Coefficient = c("categoria_veicoloPeople",
"tipologia_alimentazionePetrol",
"tipologia_alimentazioneElectric",
"is_pollutantTRUE",
"hour"),
RealCategory = c("People (car)",
"Petrol",
"Electric",
"Pollutant",
"Hour"),
LD1_Coefficient = c(1.18,
1.09,
-1.72,
-1.46,
0.04),
Comment = c("Strongest positive discriminant",
"Second strongest pos. discriminant",
"Strongest negative discriminant",
"Decently strong discriminant",
"Costant push toward afternoon"),
stringsAsFactors = FALSE)
knitr::kable(
table_ld1,
caption = "LD1 Coefficients (LDA)",
align = "l")
| Coefficient | RealCategory | LD1_Coefficient | Comment |
|---|---|---|---|
| categoria_veicoloPeople | People (car) | 1.18 | Strongest positive discriminant |
| tipologia_alimentazionePetrol | Petrol | 1.09 | Second strongest pos. discriminant |
| tipologia_alimentazioneElectric | Electric | -1.72 | Strongest negative discriminant |
| is_pollutantTRUE | Pollutant | -1.46 | Decently strong discriminant |
| hour | Hour | 0.04 | Costant push toward afternoon |
rm(table_ld1)
Combining the two comments we can say that the average Area C resident has a private car, with an high probability of being petrol-powered and it likes to drive in the afternoon. Even if the electrification is not a trend, residents are usually eco-friendly.
The LDA is a linear rigid model that may have omitted some situation where residents and non-residents overlap (especially in some gates or specific time). The KNN will not make assumption over the data but it will check how close are the data between every transit. If a vehicle at 18:00 is surrounded by residents, it’ll be considered a resident.
The predictor chosen for this model are the ones that we can consider significant from the other analysis: hour, type of day, type of vehicle, pollution and fuel type.
[!IMPORTANT] IMPORTANT NOTE: KNN uses a ton of ram
(which i don’t have…) so unfortunately I had to downsize the dataset a
lot, use only numerical or boolean variables and scale. This may
have significally impacted the results and voided their validity during
the comparison phase. Another downside is the presence of too
many ties, probably due to all the number transformation needed; to
overcome the issue I’ve used KKNN instead of the classic
KNN. KKNN simply assign more weight to the
closest vehicles removing the ties.
# Data preparation
knn_preparation <- function(data) {
data %>%
mutate(hour_scaled = as.numeric(scale(hour)), #Scale since other vars are 0/1
is_weekend_num = as.numeric(as.logical(is_weekend)),
is_pollutant_num = as.numeric(as.logical(is_pollutant)),
# Bool vars for important predictors
is_diesel = ifelse(tipologia_alimentazione == "Diesel", 1, 0),
is_electric = ifelse(tipologia_alimentazione == "Electric", 1, 0),
is_car = ifelse(categoria_veicolo == "People", 1, 0)) %>%
dplyr::select(hour_scaled, is_weekend_num, is_pollutant_num,
is_diesel, is_electric, is_car)
}
# Samples creation
set.seed(123)
# Train set (30k residents, 30k non)
train_sub <- rbind(train_data[train_data$residenti == TRUE, ] %>% sample_n(30000),
train_data[train_data$residenti == FALSE, ] %>% sample_n(30000))
train_knn_final <- knn_preparation(train_sub)
train_knn_final$residenti <- as.factor(train_sub$residenti)
# Train set (origianl proportion)
test_sub <- test_data %>% sample_n(100000)
test_knn_final <- knn_preparation(test_sub)
real_target <- as.factor(test_sub$residenti)
# Model
knn_model <- kknn(residenti ~ .,
train = train_knn_final,
test = test_knn_final,
k = 15,
kernel = "triangular")
predizioni_knn <- fitted(knn_model)
KNN does not produce a direct output (it’s just a matrix), so we’ll skip the comment phase.
The comparison between models we’ll be conducted by computing:
As first step, let’s launch the model over the test dataset.
# Logistic Regression
prob_logit <- predict(logit_model, newdata = test_sub, type = "response")
pred_logit <- ifelse(prob_logit > 0.5, TRUE, FALSE)
# LDA
pred_lda <- predict(lda_model, newdata = test_sub)$class
Then let’s extract the metrics and create the comparison table and plot
# Metric extraction
extract_metrics <- function(pred, actual, name) {
cm <- confusionMatrix(factor(pred, levels=c(FALSE, TRUE)),
factor(actual, levels=c(FALSE, TRUE)),
positive = "TRUE")
data.frame(Model = name,
Accuracy = cm$overall[["Accuracy"]],
Sensitivity = cm$byClass[["Sensitivity"]],
Specificity = cm$byClass[["Specificity"]] )}
# Table
comparison_table <- rbind(
extract_metrics(pred_logit, real_target, "Logistic"),
extract_metrics(pred_lda, real_target, "LDA"),
extract_metrics(predizioni_knn, real_target, "KNN"))
print(comparison_table)
## Model Accuracy Sensitivity Specificity
## 1 Logistic 0.62148 0.8266874 0.5840595
## 2 LDA 0.62688 0.8174155 0.5921350
## 3 KNN 0.35883 0.9859301 0.2444754
# Plot
longer_table <- comparison_table %>%
pivot_longer(cols = c(Accuracy, Sensitivity, Specificity),
names_to = "Metric",
values_to = "Value")
comparative_plot <- ggplot(longer_table, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge", alpha = 0.8) +
geom_text(aes(label = round(Value, 3)),
position = position_dodge(width = 0.9),
vjust = -0.5,
size = 3.5,
color = "black") +
scale_fill_manual(values = c("Logistic" = "#1f77b4",
"LDA" = "#ff7f0e",
"KNN" = "#2ca02c")) +
labs(title = "Model Comparison",
x = "Metric",
y = "Value",
fill = "Model") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(hjust = 0.5, face = "bold"),
plot.subtitle = element_text(hjust = 0.5),
legend.position = "bottom") +
ylim(0, 1.05)
comp_plot <- ggplotly(comparative_plot)
comp_plot
The Logistic Regression and the LDA are clearly the winners. Considering the difficulties mentioned at the beginning, both of them are really accurate (over 60%). LDA has performed slightly better in terms of accuracy (+0.05), but the Logistic Regression is more capable of identifying the residents (0.826 vs 0.817). Thanks to those results we can affirm that the border between resident and non-resident is linear.
KNN had an extreme behavior: it can identify almost all the residents (0.986) but its specificity is ridiculously low (0.35); basically every user is considered a resident. This phenomena may be lead by the small distance between the residents, especially during the peak hours. KNN, looking at the peaks, gets hallucinated and starts to assign the resident label to everybody. This model can’t be considered while answering the research questions.
As last step, let’s check the confusion matrix. THis will help us to observe how the models have classified every observation within the dataset. We’ll do it for the Logistic Regression model since is the one that, overall, has performed the best.
cm_logit <- confusionMatrix(factor(pred_logit, levels=c(FALSE, TRUE)),
factor(real_target, levels=c(FALSE, TRUE)),
positive = "TRUE")
# Plot
fourfoldplot(cm_logit$table,
color = c("#CC6666", "#99CC99"),
conf.level = 0,
margin = 1,
main = "Logistic Regression: Confusion Matrix")
We can clearly see that the errors are given due an resident overprediction. The model tents to categorize as resident almost 35% users that are not resident.
The identikit of a resident living inside the Cerchia dei Bastioni is composed by:
The identikit has an accuracy of about 63%.
In what context the more-pollutant-vehicles access the most?
This analysis section wants to identify what make the pollutant vehicles access the Area C gates. To do so we had two possibilities: Naive Bayes and Linear Discriminant Analysis. However, we’ll stick with the already consolidated LDA thanks to its ability of predicting multiple classes at the same time while maximising the distance between them.
# Clean old variables
rm(cm_logit, comp_plot, comparison_table, knn_model, lda_model, logit_model, longer_table, test_knn_final, train_balanced, train_knn_final, train_sub, test_sub, comparative_plot, pred_lda, pred_logit, n_residenti, predizioni_knn, prob_logit, real_target)
# Data preparation
set.seed(123)
train_pollutant <- train_data %>%
group_by(is_pollutant) %>%
sample_n(min(table(train_data$is_pollutant))) %>% # Bilanciamo i gruppi
ungroup()
# Model
lda_pollutant <- lda(is_pollutant ~ hour + is_weekend + residenti +
tipologia_alimentazione + categoria_veicolo,
data = train_pollutant)
print(lda_pollutant)
## Call:
## lda(is_pollutant ~ hour + is_weekend + residenti + tipologia_alimentazione +
## categoria_veicolo, data = train_pollutant)
##
## Prior probabilities of groups:
## FALSE TRUE
## 0.5 0.5
##
## Group means:
## hour is_weekendTRUE residentiTRUE tipologia_alimentazionePetrol
## FALSE 12.34228 0.2562988 0.16312500 0.2621241
## TRUE 11.73386 0.2980682 0.03305374 0.1175690
## tipologia_alimentazioneDiesel tipologia_alimentazioneElectric
## FALSE 0.2135725 0.08063766
## TRUE 0.8824114 0.00000000
## tipologia_alimentazioneGases tipologia_alimentazioneHybrid
## FALSE 0.1060646 0.1937036
## TRUE 0.0000000 0.0000000
## tipologia_alimentazioneFuel-oil mixture categoria_veicoloOther
## FALSE 0.000000e+00 3.234274e-03
## TRUE 1.960166e-05 6.415088e-05
## categoria_veicoloBus categoria_veicoloGoods categoria_veicoloPeople
## FALSE 0.02457157 0.1575617 0.7590047
## TRUE 0.04684262 0.2939750 0.6591165
##
## Coefficients of linear discriminants:
## LD1
## hour -0.0009862814
## is_weekendTRUE 0.1263274905
## residentiTRUE -0.8123581739
## tipologia_alimentazionePetrol 1.3263824477
## tipologia_alimentazioneDiesel 3.1631958842
## tipologia_alimentazioneElectric -0.0902929392
## tipologia_alimentazioneGases -0.0184417559
## tipologia_alimentazioneHybrid 0.0139285390
## tipologia_alimentazioneFuel-oil mixture 3.7763618225
## categoria_veicoloOther -0.9953477287
## categoria_veicoloBus -0.0606315455
## categoria_veicoloGoods -0.0106757748
## categoria_veicoloPeople 0.0861495400
By the conducted analysis we can say that the pollutant increase is defined by several factors:
TO verify if the prediction are corretc, let’s compute the evaluation metrics using the testing dataset.
# Data preparation
test_pollutant <- test_data %>%
dplyr::select(is_pollutant, hour, is_weekend, residenti,
tipologia_alimentazione, categoria_veicolo) %>%
filter(tipologia_alimentazione %in% levels(train_pollutant$tipologia_alimentazione)) %>%
mutate(is_pollutant = as.factor(is_pollutant)) %>%
droplevels()
# Prediction
pred_lda_poll <- predict(lda_pollutant, newdata = test_pollutant)
# Evaluation
cm_lda_poll <- confusionMatrix(pred_lda_poll$class, test_pollutant$is_pollutant, positive = "TRUE")
print(cm_lda_poll$overall['Accuracy'])
## Accuracy
## 0.7930682
print(cm_lda_poll$byClass[c("Sensitivity", "Specificity", "Precision")])
## Sensitivity Specificity Precision
## 0.8813702 0.7856882 0.2557950
The accuracy sensitivity and specificity at almost 80% represent some really good results. The 25% on the precision is given by the overall low level of pollutant vehicles that may hallucinate the model by letting it assigning to a non pollutant vehicle the pollutant label.
Which are the clou gates?
During this section, we’ll try to figure out which are the clou gates among the 43 available. Originally I wanted to answer to this question only from a quantitative perspective (using lasso), but I though that explaining also why it may be more interesting. For this reason, we’ll use Naive Bayes for this research question. This choice is made since Naive Bayes should works perfectly with the predictors I’ve picked up (resident, vehicle type and hour) and should offer the best results with the conditional probability (given X characteristic, which gate will be chosen?)
# Remove old stuff
rm(lda_pollutant, train_pollutant, test_pollutant, pred_lda_poll, cm_lda_poll)
# Chose the top 5 gates.
#Due to initial sampling, don't sort the gates inside train_data but select them manually from what saw in the EDA
core_gates_list <- c("TURATI", "VENEZIA", "PORTA VITTORIA", "MASCHERONI", "BOCCACCIO")
# Data preparation
train_nb <- train_data %>%
filter(label %in% core_gates_list) %>%
dplyr::select(label, residenti, categoria_veicolo, is_pollutant, is_weekend) %>%
mutate_all(as.factor) %>%
droplevels()
# Model
nb_model <- naiveBayes(label ~ ., data = train_nb)
print(nb_model$tables)
## $residenti
## residenti
## Y FALSE TRUE
## BOCCACCIO 0.8482499 0.1517501
## MASCHERONI 0.8421060 0.1578940
## PORTA VITTORIA 0.8727879 0.1272121
## TURATI 0.8649199 0.1350801
## VENEZIA 0.8841315 0.1158685
##
## $categoria_veicolo
## categoria_veicolo
## Y Not Specified Other Bus Goods People
## BOCCACCIO 0.034433576 0.002212396 0.068697426 0.187350931 0.707305671
## MASCHERONI 0.044922067 0.001841290 0.039740053 0.167801114 0.745695476
## PORTA VITTORIA 0.050810062 0.013338439 0.040647442 0.196754891 0.698449167
## TURATI 0.050092521 0.003191687 0.046564146 0.189557762 0.710593884
## VENEZIA 0.052635788 0.004497200 0.036214292 0.221578242 0.685074479
##
## $is_pollutant
## is_pollutant
## Y FALSE TRUE
## BOCCACCIO 0.91398157 0.08601843
## MASCHERONI 0.90236423 0.09763577
## PORTA VITTORIA 0.90184745 0.09815255
## TURATI 0.90373203 0.09626797
## VENEZIA 0.89270373 0.10729627
##
## $is_weekend
## is_weekend
## Y FALSE TRUE
## BOCCACCIO 0.7314872 0.2685128
## MASCHERONI 0.7336289 0.2663711
## PORTA VITTORIA 0.7295661 0.2704339
## TURATI 0.7389796 0.2610204
## VENEZIA 0.7442258 0.2557742
The results are coherent with the expected output respect to the EDA. We have the confirm that using a qualitative approach instead of a quantitative one have lead to some interesting details. Among the top 5 gates:
The traffic distribution in the weekend is flatter: during Saturday and Sunday the gates have almost the same quota (26% on average). The pollution information aren’t a good discriminant.
Let’s run the just-trained Naive Bayes model over the testing data.
# Data preparation
test_nb <- test_data %>%
filter(label %in% core_gates_list) %>%
dplyr::select(label, residenti, categoria_veicolo, is_pollutant, is_weekend) %>%
mutate_all(as.factor) %>%
droplevels()
# Prediction
pred_nb <- predict(nb_model, test_nb)
# Evaluation
cm_nb <- confusionMatrix(pred_nb, test_nb$label)
print(cm_nb$overall['Accuracy'])
## Accuracy
## 0.2292364
print(cm_nb$byClass[, c("Precision", "Recall")])
## Precision Recall
## Class: BOCCACCIO 0.2431577 0.214159250
## Class: MASCHERONI 0.2893082 0.001243411
## Class: PORTA VITTORIA 0.5389021 0.013722106
## Class: TURATI 0.2257283 0.658151172
## Class: VENEZIA 0.2178576 0.187677994
# Plot
cm_table <- as.data.frame(cm_nb$table)
ggplot(data = cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white") +
scale_fill_gradient(low ="#ffb88e", high = "#ea5753") +
geom_text(aes(label = Freq), vjust = 1) +
theme_minimal() +
labs(title = "Confusion Matrix Heatmap",
x = "Real data",
y = "Predicted data") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Unfortunaly the accuracy is quite low compared to the other model used in this project. Having a 22% of accuracy is telling us that the user behavior is very similar among the top 5 gates, making the distintion difficult.
The Turati recall is probably the main cause: looking at the EDA, its traffic is way bigger than the other 4 main gates. Due to this unbalancement the model assign almost every vehicle to Turati. Also, the plot confirms the Turati-attraction: instead of a diagonal, the darker colors are focused just at the Turati gate.
A nice result is obtained with Porta Vittoria where the model has profiled correctly 53% of the transits, making this gate the more predictable.
Which are the gates preferred by residents?
The final question we’ll answer is related again to the resident identikit. However, this time we’ll use the gate perspective, finding the gates used the most by the residents.
Since we’ll work with a categorical predictor such as the gate name but we have a binary variable, we’ll use the Least Absolute Shrinkage and Selection Operator model. The Lasso should have the ability of removing the gate coefficients that are irrelevant for our purpose while keeping alive the interesting ones.
As before, in the data preparation, we have to convert predictor as number and use binomial family since the target is binary.
[!IMPORTANT] IMPORTANT NOTE: like KNN, using the lasso with a matrix with 43 gates uses a ton of ram. I’ve chose to sample 100’000 rows (50k residents, 50k not). This may have significally impacted the results.
# Clean old stuff
rm(cm_nb, cm_table, nb_model, scores, test_nb, train_nb, pred_nb, conf_matrix, conf_matrix_prop, core_gates_list)
## Warning in rm(cm_nb, cm_table, nb_model, scores, test_nb, train_nb, pred_nb, :
## object 'scores' not found
## Warning in rm(cm_nb, cm_table, nb_model, scores, test_nb, train_nb, pred_nb, :
## object 'conf_matrix' not found
## Warning in rm(cm_nb, cm_table, nb_model, scores, test_nb, train_nb, pred_nb, :
## object 'conf_matrix_prop' not found
# Sampling
set.seed(123)
train_lasso_sub <- train_data %>%
group_by(residenti) %>%
sample_n(50000) %>%
ungroup()
# Matrix: X (gates) e Y (residents)
X_lasso <- model.matrix(residenti ~ label - 1, data = train_lasso_sub)
Y_lasso <- as.numeric(train_lasso_sub$residenti)
# Model (\w cross validation)
cv_lasso <- cv.glmnet(X_lasso, Y_lasso, family = "binomial", alpha = 1)
coef_lasso <- coef(cv_lasso, s = "lambda.1se") # lambda.1se for a more solid model
# Dataframe
gates_results <- data.frame(gate = rownames(coef_lasso),
coefficient = as.numeric(coef_lasso)) %>%
filter(coefficient != 0 & gate != "(Intercept)") %>%
mutate(gate = gsub("label", "", gate)) %>%
arrange(desc(coefficient))
# Results
print("Positive Coefficients (residential gates):")
## [1] "Positive Coefficients (residential gates):"
print(head(gates_results, 10))
## gate coefficient
## 1 MELEGNANO 0.53513534
## 2 SERVIO TULLIO 0.37473355
## 3 PANZERI 0.37120307
## 4 BIANCA DI SAVOIA 0.26542733
## 5 CABRINI 0.20845261
## 6 AUSONIO 0.19431003
## 7 CURTATONE 0.17784579
## 8 SAN VITTORE 0.17536452
## 9 AURISPA 0.15155063
## 10 PORTA ROMANA 0.08569658
print("Negative Coefficients (non-residential gates):")
## [1] "Negative Coefficients (non-residential gates):"
print(tail(gates_results, 10))
## gate coefficient
## 22 PORTA VITTORIA -0.2000759
## 23 MILTON -0.2193887
## 24 MONFORTE -0.2292678
## 25 VENEZIA -0.2389730
## 26 RONZONI -0.3244251
## 27 MOSCOVA -0.3337213
## 28 PORTA VIGENTINA -1.1715788
## 29 MAGENTA -1.6793358
## 30 LAMARMORA -1.8340362
## 31 ITALIA -2.2323011
#Plot
sf_gatesGPS <- readRDS("Cache/sf_gatesGPS.rds")
gate_geo <- sf_gatesGPS %>%
filter(!(label %in% c("ITALIA", "LAMARMORA", "MAGENTA", "PORTA VIGENTINA"))) %>% #remove smaller gates
rename(gate = label) %>%
inner_join(gates_results, by = "gate")
print(gate_geo)
## Simple feature collection with 27 features and 4 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 9.165995 ymin: 45.45229 xmax: 9.206366 ymax: 45.48052
## Geodetic CRS: WGS 84
## First 10 features:
## id_amat gate Location coefficient
## 1 59 MOSCOVA (45.47828296205978, 9.181902909864968) -0.333721315
## 2 60 VOLTA (45.480523187784165, 9.182924909605438) -0.078352562
## 3 62 MILAZZO (45.479920813778165, 9.187838561801035) -0.195859381
## 4 63 CASTELFIDARDO (45.479455867804546, 9.191496914826553) -0.063145618
## 5 64 TURATI (45.47721995713646, 9.195748794193307) -0.075227558
## 6 65 VENEZIA (45.47399965330713, 9.204510521600662) -0.238973025
## 7 66 BARETTI (45.47154536491499, 9.205037675733562) 0.047721758
## 8 67 VITALI (45.47084219094934, 9.205062980382158) 0.009876337
## 9 68 ROSSINI (45.469437339921, 9.20528237775031) -0.110833545
## 10 69 MONFORTE (45.467746462608446, 9.205464017182113) -0.229267815
## geometry
## 1 POINT (9.181903 45.47828)
## 2 POINT (9.182925 45.48052)
## 3 POINT (9.187839 45.47992)
## 4 POINT (9.191497 45.47946)
## 5 POINT (9.195749 45.47722)
## 6 POINT (9.204511 45.474)
## 7 POINT (9.205038 45.47155)
## 8 POINT (9.205063 45.47084)
## 9 POINT (9.205282 45.46944)
## 10 POINT (9.205464 45.46775)
pal <- colorNumeric(
palette = "RdYlGn", # Red-Yellow-Green
domain = gate_geo$coefficient)
leaflet(gate_geo) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircleMarkers(radius = ~abs(coefficient) * 10 + 4,
color = ~pal(coefficient),
stroke = TRUE, fillOpacity = 0.7,
popup = ~paste0("Gate: ", gate, "<br>",
"Residential index: ", round(coefficient, 3), "<br>",
ifelse(coefficient > 0, "More residents", "Less residents"))) %>%
addLegend(pal = pal,
values = ~coefficient,
title = "Preferred gates",
labFormat = labelFormat(prefix = " "),
opacity = 1)
Given the Lasso and map results, we can see a clear distintion about the preferred gates:
Switching from Naive Bayes to Lasso has allowed us to fine tuning our research: we went from generalistic core gates (from Bayes) to isolated and precise gates thanks to the Lasso ability to isolate the less interesting predictor gates.
As last step, le’s check the bounty of the prediction with the testing dataset.
# Data preparation
X_test_lasso <- model.matrix(residenti ~ label - 1, data = test_data)
Y_test_lasso <- as.factor(test_data$residenti)
# Prediction
prob_lasso <- predict(cv_lasso, newx = X_test_lasso, s = "lambda.1se", type = "response")
pred_lasso <- ifelse(prob_lasso > 0.5, TRUE, FALSE)
pred_lasso <- factor(pred_lasso, levels = c("FALSE", "TRUE"))
# Confusion matrix
cm_lasso <- confusionMatrix(pred_lasso, Y_test_lasso, positive = "TRUE")
# Results
print(cm_lasso$overall['Accuracy'])
## Accuracy
## 0.4127804
print(cm_lasso$byClass[c("Sensitivity", "Specificity", "Precision")])
## Sensitivity Specificity Precision
## 0.7462036 0.3527305 0.1719312
An accuracy of 41% in a multiclass analysis, and with an unbalanced number of residents, is a valid results; combining the sensitivity with the accuracy will reveal a good model validation.